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Abstract 

Independent Component Analysis (ICA), a well-known approach in statistics, assumes that 
data is generated by applying an affine transformation of a fully independent set of random 
variables, and aims to recover the orthogonal basis corresponding to the independent random 
variables. We consider a generalization of ICA, wherein the data is generated as an afhnc 
transformation applied to a product of distributions on two orthogonal subspaces, and the 
goal is to recover the two component subspaces. Our main result, extending the work of Frieze, 
Jerrum and Kannan, is an algorithm for generalized ICA that uses local optima of high moments 
and recovers the component subspaces. When one component is on a fc-dimensional "relevant" 
subspace and satisfies some mild assumptions while the other is "noise" modeled as an (n — k)- 
dimensional Gaussian, the complexity of the algorithm is T(fc, e) + poly(n) where T depends 
only on the fc-dimensional distribution. We apply this result to learning a k-subspace junta, i.e., 
an unknown 0-1 function in R n determined by an unknown fc-dimensional subspace. This is a 
common generalization of learning a fc-junta in R n and of learning an intersection of k halfspaces 
in K", two important problems in learning theory. 

Our main tools are the use of local optima to recover global structure, a gradient-based 
algorithm for optimization over tensors, and an approximate polynomial identity test. Together, 
they significantly extend ICA and the class of fc-dimensional labeling functions that can be 
learned efficiently. 



1 Introduction 



Independent Component Analysis (ICA) |25j is a statistical approach that models data in M n 
as generated by a distribution consisting of n linear combinations of n independent univariate 
component distributions, y = Ax with x,y £ W 1 , X{ are independent random variables and A is an 
invertible n x n matrix; in other words, an affine transformation of a product distribution. The goal 
is to recover the underlying component distributions of the X{ given only a set of observations y. 
Special cases of ICA are of interest in many application areas with large or high-dimensional data 
sets [24j- An important feature of ICA, as we will presently see, is that it can provide an insightful 
representation even when Principal Component Analysis (PCA) does not. 

In this paper, we consider generalized ICA, where instead of n independent one-dimensional 
distributions, we only assume two independent distributions on complementary subspaces. This 
natural extension of ICA provides a common generalization of two fundamental problems in high- 
dimensional learning, where one sees labeled points (examples) from an unknown distribution la- 
beled by an unknown 0-1 function and the goal is to find a labeling function that agrees on most of 
the distribution [3D]. The first, introduced by A. Blum [8], is learning a function of k coordinates 
in M. n , known as a fc-junta. The second is the problem of learning an intersection of k halfspaces 
in M n [BJ [7] (k = 1 is the classic problem of learning a halfspace). Although the complexity of 
both problems is far from settled, there has been much progress in recent years for special cases, 
as we discuss in Section 11.11 Indeed, generalized ICA can be applied to the problem of learning 
an unknown function of an unknown /c-dimensional subspace of R n , provided the distribution on 
points can be factored into independent distributions on the fc-dimensional "relevant" subspace and 
the (n — /c)-dimensional "noise" subspace. 

We give an algorithm for generalized ICA that can be viewed as a tensor version of PCA 
applied to higher moments, specifically local optima of moment functions to infer the component 
distributions. The algorithm uses a second-order gradient descent method and an approximate 
version of the Schwartz-Zippel polynomial identity test, while its analysis needs tools from convex 
geometry and probability. Before we describe our results and techniques in detail, we summarize 
the known algorithmic approaches to ICA. 

For the problem of identifying the source components given only their linear combinations as 
data, PCA suggests the approach of using principal components of the data as candidates for the 
component directions. This would indeed recover the components if the covariance matrix of the 
data has distinct nonzero eigenvalues. However, if variances along two or more directions are equal, 
then the principal components are not uniquely defined and PCA does not work. In more detail, 
assume that the data is centered, i.e., its mean is zero. Then PCA can be viewed as finding vectors 
on the unit sphere that are local optima of the second moment of the projection, max^ggn-i ||^4x|| 2 
where A is m x n with each row being a data point. These maxima are eigenvalues of A T A, the 
covariance matrix of A and hence attain at most n distinct values. The values and the corresponding 
vectors can be approximated to arbitrary accuracy efficiently. 

What to do when eigenvalues are repeated? To address this, the idea in ICA is to consider a 
broader class of functions to optimize. A natural choice is higher moments. The use of local optima 
of fourth moments was suggested as early as 1991 |301 115], When the component distributions are 
sufficiently far from being Gaussian, the local optima of a family of functions on the unit sphere 
are the component directions |371 [T7] (if component distributions are Gaussians, then their linear 
combinations are also Gaussian and the linear transformation A might not uniquely defined). This 
approach can be turned into an a polynomial-time algorithm for unraveling a product distribution 
of a wide class of one-dimensional distributions. 
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We now describe generalized ICA, which significantly weakens the ICA assumption of a full 
product distribution. Namely, we assume that the distribution F in M n can be factored into a 
product of two independent marginal distributions Fy and Fyy on unknown orthogonal subspaces 
V and W = V- 1 , i.e., F = FyFyy- We call such an F factorizable. Thus, a random point in F 
is generated by first picking its coordinates in V according to Fy and then independently picking 
coordinates in W according to Fyy- The corresponding problem is the following. 

Problem 1 (Factoring distributions). Given (unlabeled) samples from a factorizable distribution 
F = FyFw over R n (with V and W unknown), recover a factorization of F. 

If F in fact factorizes further into product of more distributions, or even a full product distri- 
bution of one-dimensional component distributions as in ICA, an algorithm for the above problem 
can be applied recursively to find the full factorization. We will give an algorithm for this problem 
under further mild assumptions (roughly speaking, at least one of Fy , Fyy is sufficiently different 
from being a Gaussian). Our approach is based on viewing PC A as a second moment optimization 
problem, then extending this to higher moments (alternatively, optimization over tensors). Al- 
though such tensor optimization is intractable in general, for our setting, it will turn out that local 
optima provide valuable information, and can be approximated efficiently. 

The factoring problem above has direct applications to learning in high dimension. Let ny 
denote projection to a subspace V. We consider labeling functions £ : W 1 — > {0, 1} of the form 
£{x) = l(iry(x)). We are given points according to some distribution F over M. n along with their 
labels £{x) = £(iry(x)) for some unknown subspace V of dimension k (the 'relevant' subspace), and 
wish to learn the unknown concept £, i.e., find a function that agrees with £ on most of F. We 
call this the problem of learning a k-subspace junta. We further assume that F is factorizable as 
F = FyFw, with W = V 1 - (the 'irrelevant' subspace). The justification for this factorizability 
assumption is that coordinates in the W subspace are not relevant to the labeling function and can 
be considered to be noisy attributes. The full statement of our learning problem is as follows: 

Problem 2 (Learning a fc-subspace junta). For e, 5 > 0, given samples drawn from a factorizable 
distribution F = FyF\y, and labeled by a £ = £oiry, find a 0-1 function f such that with probability 
at least 1 — 5, 

Pv(£(x) + fix)) < e. 

F 

Our algorithm for generalized ICA leads to an efficient algorithm for learning /c-subspace juntas 
for a large class of ambient distributions F. 

1.1 Related work 

Jutten and Herault formalized the ICA problem [25] and mention in their paper that variants of this 
problem had appeared in a variety of different fields prior to this (the earliest such mention is in [3] ) . 
The notion that random variables should be far from being Gaussian pervades ICA research. By 
the central limit theorem, sums of independent random variables converge to a Gaussian, whereas 
individually the latent random variables are not Gaussian. Thus finding directions that maximize 
some notion of non-Gaussianity might reveal the latent variables. This intuition is formalized by 
introducing functions which serve as a proxy for non-Gaussianity, called "contrast functions" in the 
ICA literature. The definition of a contrast function is that maximizing a contrast function will give 
an independent component. Some examples of contrast functions include the kurtosis (4th order 
analogue of variance) [30 , 15j, various cumulants, and functions based on the so-called negentropy 
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(|16j). Additionally, there are a variety of tensor methods and maximum likelihood methods used 
flB"! [5]. While there are many algorithms proposed for ICA, some of which appear to perform 
well in practice (e.g., FastICA [23]), there are almost no explicit time complexity bounds. Frieze, 
Jerrum and Kannan [19] were the first to give a polynomial complexity bound for this special case 
of ICA, namely a product of uniform distributions on intervals, which can also be viewed as the 
problem of learning an unknown parallelopiped from samples. They used fourth moments, an idea 
presented earlier in several papers in the ICA literature; the key structural lemma is already present 
in [17], which was inspired by [37J (Lemma [3] of our paper is a generalization). Subsequently, Nguyen 
and Regev [33] simplified Frieze et al's gradient descent algorithm and provided some cryptographic 
applications. 

A different motivation for our work comes from computational learning theory, where learning 
a k- junta is a fundamental problem [8]. In this problem, one is given points from some distribution 
over {0, l} n , labeled by a Boolean function that depends only on k of the n coordinates. The goal is 
to learn the relevant k coordinates and the labeling function. Naive enumeration of k subsets of the 
coordinates leads to an algorithm of complexity roughly n k . Mossel et al [32j gave an algorithm of 
complexity roughly 0(n°' 7fc ) assuming the uniform distribution over {0, l} n . For other special cases 
of Problem O previous authors have applied standard low-dimensional representation techniques, 
low-degree polynomials, random projection and Principal Component Analysis (PCA) to identify V 
under strong distributional assumptions [H [27] El RTf] H3] . The strongest result in this line achieves 
a fixed polynomial dependence on n by applying PCA to learn convex concepts over Gaussian 
input distributions [42j . Unfortunately, standard PCA does not work for other distributions or 
more general concept classes, in part because PCA does not provide useful information when the 
covariance matrices of the positive and negative samples are equal. In fact, the problem appears 
to be quite hard with no assumptions on the input distribution, even for small values of k, e.g., a 
single halfspace can be PAC-learned via linear programming, but learning an intersection of two 
halfspaces (a 2-subspace junta) in polynomial time is an open problem. 

There have been a number of extensions of PCA to tensors [29] analogous to SVD, although no 
method is known to have polynomial complexity. One approach is to view PCA as an optimization 
problem. The top eigenvector is the solution to a matrix optimization problem: 

max v T Av = yX^)h,i2 v h v i2 
||d||=1 

where A is the covariance matrix. A higher moment method optimizes the multilinear form defined 
by the tensors of higher moments: 

max A(v, ...,v) = V" A^^v^ . . . v ir . 

Unlike the bilinear case, finding the global maximum of a multilinear form is hard. For a > 16/17, 
it is NP-hard to approximate the optimum to better than factor a^/ 4 ! [Jo] ) anc l the best known 
approximation factor is roughly n r ' 2 . Several local search methods have been proposed for this 
problem as well [28 j. 

1.2 Results 

To state our results formally, we need to define the distance of a distribution from a Gaussian via 
moments. For a random vector x G W 1 with distribution F, the m th moment tensor M m is a tensor 
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of order m with n m entries given by: 

M™_ im =E( Xil ...x im ). 

Let r n be the standard Gaussian distribution over W 1 and j m denote the m th moment of a standard 
Gaussian random variable: j m = (m — 1)!! when m is even and when m is odd. 
The m th -moment distance of two distributions F, G over R n is defined as 

d m (F, G) = max |E F ({x T u) m ) - E G ((x T u) m ) I = \\Mf - M%\\ 2 . 

\\u\\=l 

We say that a distribution F over R fc is (m, i])- moment- distinguishable along unit vector u G R fc , if 
either there exists j < m: 

|E F ((ar T u) J ') -7j| > 7/ 

or there exist unit vectors {v i, . . . , vt} C u 1 - where t < m such that 

|E F ((x T n) m -*n* =1 (x T ^)) - E F ((x T n) m '*) E (n* =1 (x T ^))| > V . 

In words, F differs from a Gaussian either along some direction u, or by exhibiting a correlation 
between its marginal along u and vectors orthogonal to u (for a Gaussian such subsets have zero 
correlation) . The rationale for this definition is that if two continuous distributions are identical (or 
close) in many moments, then one would expect them to be close in L 1 distance. For example, the 
following holds for one-dimensional logconcave distributions via an explicit bound on the number 
of moments required. 

Lemma 1 (L 1 distance from Gaussian). Fix m and e > 0. Let f : R — > K be an isotropic logconcave 
density, whose first m moments satisfy \Ef (x m ) — 7 m | < e, then: 

\\f-9\\ 1 <(^T 8 + c'me^) 1/2 lo g m < c + eme^ 

We are now ready to state our first main result: we can efficiently factorize distributions assum- 
ing the distribution on the relevant subspace is moment-distinguishable and the distribution on the 
irrelevant noisy attributes is some Gaussian. In what follows, it might be illustrative to regard k 
as a constant independent of n. Let C F (n, m, e) be the number of samples needed to estimate each 
entry of the m th moment tensor of F to within additive error e and M be an upper bound on the 
m th moment along any direction. 

Theorem 1 (Factoring, Gaussian noise). Let F = FyFyy be a distribution over W 1 where V is 
a subspace of dimension k, and Fw = T n ~ k . Suppose that Fy is (m,rj) -moment- distinguishable 
for each unit vector u £ V. Then for any e, 5 > 0, in time C F (n, m, e)poly(n, rj, 1/e, log(l/<5), M), 
Algorithm Factor Under Gaussian finds a subspace U of dimension at most k such that for j < m, 
dj{F,FuFjj±_) < j(M + 7j)e with probability at least 1—5. In addition, for any vector in u G U, 
\\nv(u)\\ > 1 — e. 

Next we turn to learning. For a distribution F and a fc-dimensional concept class H, we say 
that the triple (k,F,T~L) is (m,r\)-moment-learnable if: 

1. F = FyFw is a factorizable distribution with dim(V) = k. 
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2. Ti is a set of fc-subspace juntas whose relevant subspaces are contained in V . 

3. For I £ H with minimal (with respect to dimension) relevant subspace P C V, for each unit 
vector u 6 P, either or (the distribution over the positive samples) is (m, jj)-moment 
distinguishable along u. 

In words, the third condition says that if Fy resembles a Gaussian in its first m moments along 
every direction, then Fy does not. We will see examples of concept classes and distributions for 
which m is bounded under this definition. Indeed, we conjecture that a concept class % with 
bounded VC dimension d is (m, rf) moment-learnable where m depends only on d and r\. 

To state our learning guarantee, we need one more definition: A triple (k,F,H) is called robust 
if for any subspace U of dimension at most k with orthonormal basis {ui} where |u?Vy(uj) | > 1 — e, 
then £(ttjj(x)) labels correctly 1 — g{e) fraction of M" under F where g(e) < e c for constant c > 
and sufficiently small e. The definition requires the distribution F and labeling function £ to be 
robust under small perturbations of the relevant subspace. Once we identify the relevant subspace 
approximately, we can project samples to it and use an algorithm that can learn I in spite of a g(e) 
fraction of noisy labels. 

Theorem 2 (Learning, Gaussian noise). Let e,5 > 0, let £ £ T~L where {k,F,7-L) is (m, ^-moment- 
learnable and robust, and let F\y = Y n ~ k be Gaussian. Suppose that we are given labeled examples 
from F, then Algorithm Learnllnder Gaussian identifies a subspace U and a hypothesis h such 
that h correctly classifies 1 — e of F according to £ with probability at least 1—5. The time and sample 
complexity of the algorithm are bounded by T(k, e) + Cp(n, m, e)poly(n, r/, k, 1/e, log(l/<5), M) where 
T is the complexity of learning the k-dimensional concept class %. 

We note here that for a concept class of VC-dimension d, a standard reduction implies that 
the complexity of learning with e arbitrary noise is at most (2/e)° (dlog(1/e)) times the complexity 
of learning with no noise (Proposition [9]). Our algorithms run in polynomial-time in n provided 
(k, F, T~L) satisfy the moment-learnable condition. Some special cases of this result were previously 
known, e.g., when F is a Gaussian and H is a convex concept class [271 H2]- The application of 
PCA to learning convex bodies in [42] can be viewed as the assertion that convex concepts in M fc 
are moment-learnable: under a Gaussian distribution, the positive distribution F + has variance 
less than 1 along any direction. The following two examples further illustrate Theorem [2j 

• When the full distribution in the relevant subspace is uniform in an ellipsoid, then robust 
concept classes can be learned in time T(k, e) + C& j£ -n 2 . Here T depends on the k and concept 
class, and C is a constant fixed by k and e and independent of the concept class. Thus we can 
learn general concept classes beyond convex bodies and low-degree polynomials for uniform 
distributions over a ball in the relevant subspace. 

• When the distribution on the positive examples F + has bounded support, i.e., the positive 
labels lie in a ball of radius r(k), such robust concepts can be learned in time T(k, e) + C^^ ■ 
n O(r(k)' 2 ) £ or an arD rt rar y distribution in the relevant subspace. Previously, for logconcave 
F, learning an intersection of k half-spaces was known to have complexity growing as 

03 [26]. 

1.3 Techniques 

Our strategy is to identify the relevant subspace V is to examine higher moments of the distribution. 
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As mentioned earlier, our approach is inspired by viewing PCA as finding the global maxima 
and minima of the bilinear form defined by the covariance matrix. Instead of trying to compute 
global optima of the multilinear form, we use local optima. These local optima turn out to be 
highly structured. The use of local optima can be viewed as an effective realization of higher-order 
PCA that leads to efficient algorithms. Previous algorithmic problems have all required the use of 
global optima — for example, the planted clique algorithms of [201 HI] • We prove that local optima 
of the m th moment f m (u) = E ((x T it) m ) must lie entirely in V or its complement W (Lemma H]) 
unless its first m moments are identical to those of a Gaussian. 

To make these ideas algorithmic, we use a local search method that increases the function value 
by performing first-order moves along the gradient and then second-order moves in the direction of 
the top eigenvector of the Hessian matrix. These second-order moves allow us to avoid saddle points 
and other critical points which arise in higher dimensions. Saddle points have a gradient of zero and 
look like maxima in some directions and minima in others. While searching for a local maximum, 
one could end up in a saddle point. The top eigenvector of the Hessian shows directions of greatest 
quadratic increase, and hence will move us from the saddle point to a true local maximum. 

Another component in our algorithms is an approximate version of the well-known Schwartz- 
Zippel polynomial identity test. Observing that f m (u) is a polynomial of degree m in the variables 
til, ... , u m , in principle we can test ( whether f m is a constant function by evaluating f m at random 
points. We use a robust version of this test (Lemma I12p derived via a result of Carbery and Wright 
Pi- 

2 Structure of local optima 

We derive a representation of f m (u) = E ((a;*u) m ) in Lemma El Using this representation, we show 
in Lemma H] that each local optimum lies in V or W exclusively. Finding a sequence of orthogonal 
local optima will give us basis vectors for the relevant subspace. 

For convenience we often use uy = -Ky(u) for the projection of u onto V, uw for the projection 
onto the orthogonal subspace W, and vP for the unit vector in the direction of u. 

We may assume that E (x) = 0: if otherwise, then we can apply a translation x — E (x). 

Lemma 2 (Translation of product distributions). Let x G IR n be a random vector drawn from 
F = FyFw, a product distribution. Then x — E (x) has a product distribution over V and W . 

Proof of Lemma\^ Take our translation y = T a {x) = x + a, for Borel sets B\ and B 2 : 

Pr (y v G Bi A y w E B 2 ) = Pr (x v + a v G B 1 A x w + a w G B 2 ) 

= Pr (xy G B\ — ay A xy/ G B 2 — aw) 
= Pr {xy G Bi — ay) Pr (xw G B 2 — aw) 
= Pr (y v G B^Priyw E B 2 ) 

□ 

e can combine this with a linear transformation to obtain an isotropic distribution, given by 
y = Ti~ l / 2 (x — fj) where \x is the expectation vector. This simplifies subsequent calculations because 
the covariance matrix for y is I n . The following lemma, inspired by [191 fl^l 157] , provides the main 
insight for the structural theorem. 
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Lemma 3 (Representation of f m ). Let F = FyF\y. Suppose that x has the same j th moments as 
a Gaussian for all integers j < m, then for u € S" -1 : 

fm{u) = \\u v \\ m (E {(x v u v ) m ) - 7m ) + \\u w \\ m (E ((44) m ) - 7 m) + 7m (1) 

Proof of Lemma\^ Consider the case when m is odd: 

f m {u)=E{{x T u) m ) 

= E (^((x v + x w ) T (u v + uw))j 

m—l / 

m 



m-l / \ 

E {{xluvD + E ((a^u^D + [ m )E ((^v)') E ((^^) m ^) 



i=l 



The last line follows by applying the independence of random variables which depend only on the 
V and W subspaces. Each term in the last sum contains an odd moment of a Gaussian, hence: 

fm{u) = \\uy\\ m E({x v u v ) m ) + \\u w \\ m E((xlu° w r) . (2) 

When m is even, we need the following formula: 



El 171 \ II II* II \\m-i 
I • J \\ U V\\ \\UW\\ li1m-i=1m 



1=0 

This follows from E ((aX + bY) m ) = 7 m where a 2 + 6 2 = 1 and X and Y are independent standard 
normal variables: 

/m(«) = E ) iw* iiuwrii^E ((^)*) e ((x^r-) 

m-l / s 

II lim (i T \m\ , n lim m /V T \m\ , I I II II* II lim— * 

= \\uv\\ E((x Uy) ) + \\u W \\ E[(X Uy) ) + \ ■ ) \\ U V\\ \\Uw\\ lilm-i 

i=l ^ 1 ' 

= \\uy\r (e ((* t <d - 7 m) + n^ir (e ((x T u v r) - 7 m) + £ (jj 



u vf \\uw\\ m 1 lilm- 



i=0 

= \\u v \r (e ((x T o m ) - 7 m) + iiwir (e {(x t u v d - 7m ) + 7m 

□ 

Using this representation, we can characterize all local optima of f m . 

Lemma 4 (Support). Let distribution F = FyF\y have the same first m—l moments as a Gaussian 
but a different m th moment. Then for a local maximum (local minimum ) u* of f m restricted to the 
unit sphere, where f m (u*) > 7 m (f m (u*) < J m ), either \\u v \\ = 1 or \\v,yy\\ = 1. 
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Proof of Lemma\4\ Consider the curve C = {s(u* v )° + t(u* w )° : s 2 + t 2 = 1, s > 0, t > 0}. The 
point u* lies on C: thus if u* is a local maximum in full space, it had better be a local maximum 
on C. On the other hand, we will show that there are no local maxima interior to C, whence we 
must have \\u v \\ = 1 or \\uyy\\ = 1. 

Let us denote a v = (E ((x T n^) m ) — 7 m ) and a w = (E ((x T n^) m ) — 7 m ). By the assumption 
that f m (u*) > 7 m , we know that least one of a v or a w is positive. Suppose that s ^ and s / 1: 
we form the associated Lagrangian with positive real multiplier A: 

C = a v s m + a w t m + 7m - A(s 2 + t 2 - 1) 

At every critical point in the interior of C, we must have DC = 0: 



ma v s m — 2Xs 
ma m t m ~ l - 2Xt 







If we consider only the interior critical points where s,t > 0, then both a v > and a w > 
(otherwise we would have A > and A < 0). There is only one solution: 

l/(m-2) V(m-2) 

A = (m/2)(a„s m + a u ,t m ) 



2/(m-2) 2/(m-2) / 2/(m-2) 2/(m-2) 

If we now consider the Hessian on the tangent plane orthogonal to the gradient of the constraint 
(equivalent to considering the bordered Hessian) , we see that it is positive definite for m > 3 (when 
m = 3, differentiating a v s 3 + a w {\ — s 2 ) 1,5 twice at the critical point gives a positive value): 



D -.£ _ j m(m - l)a v s m 2 - 2A \_ m(m - 3)a v a w / ( . 

m(m - l)a w t m - 2 - 2A J ~ r 2 /(m-2) ,2 



2/(m-2) 2/(m-2) 



(m-2)/2 " 



In particular, there are no local maxima interior to C, that is, \\v,y\\ = 1 or ||tty-|| =0. □ 



3 Finding a basis 

Our two basic algorithms exploit the property that local optimum to f m (u) = E ((x T u) m ) on the 
unit sphere must lie in either V or W (Lemma [4]). In this section, we assume that the algorithms 
have access to exact moment tensors and can compute exact local optima. We provide efficient 
algorithms (with error analysis) in Section^ 

The basic idea of the algorithm is to start with a random direction and evaluate the j'th moment 
in that direction. If it is different from a Gaussian we go to a local max or local min (whichever 
keeps it different from a Gaussian) and thus find a vector of interest; if many random unit vectors 
have Gaussian moments, then all directions have Gaussian moments due to the Schwartz-Zippel 
Lemma and we go to the next higher moment. At the end of the algorithm we have a subset of an 
orthogonal basis consistent with V and W, and the property that all orthogonal directions have 
Gaussian moments. 

Lemma 5 (Schwartz-Zippel[36j). Let P £ F[x%, . . . ,x n ] be a nonzero polynomial of degree dn > 
over field F. Let S be a finite subset of F and let r\, . . . ,r n be selected randomly from S. Then: 

Pr(P(r 1 ,...,r„) = 0) < r^. 
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Algorithm 1 FindBasis 



Input: Moment bound m, Distribution F 
1: Orthonormal vectors B <— (j), moment tensor j <— 2. 
2: while \B\ < n and j < m do 

3: Compute the j th moment tensor M- orthogonal to B, so that for any v G -B -1- , = 

E((z T ^y) =Mf («,...,«). 
4: if /,•(*;/ \\v\\) = jj then 
5: j<-j + l 

6: else 

7: if fj{v) > jj for some v then 

8: Compute a local maximum u* to /j starting from v. 

9: else 

10: Compute a local minimum u* to /j starting from v. 

11: B^fiU{4 
12: return 5 



For Line 3, let ^4 : M n — > M n denote the linear map that projects orthogonal to B. Then 
M{Au,...,Au)= M lu ..., im (Au) h ■ ■ ■ (Au) lm = J2 \ J2 M h,~,i m A h,h ' ' ' A 



l m ,3m J U jl 

The identity check in Line 4 is performed by selecting a random vector x with i.i.d. uniform 
coordinates from {1, . . . , 2m} and evaluating the polynomial fj(x/\\x\\) — ^/j. Repeating 0(log(n/S)) 
times gives a 1 — 8 probability of success. 

Algorithm FindBasis does not suffice on its own. Although every direction orthogonal to B 
looks Gaussian up to the m'th moment, it is possible that some directions are correlated with 
vectors in B. The next procedure identifies such directions. 

Theorem 3 (Find Basis). Let F = FyFy/ be a factorizable distribution over M n . Then, with 
probability at least 1 — 5, each vector in the output of FindBasis lies in either V or W. 

Proof of Theorem^ From the above comment, at each step, with probability at least 1 — S/n 
(hence total failure probability 5), we are able to find a point u where f r (u) ^ 7m- hi particular, 
if f r (u) > 7 m , then we find a local maximum u* . By Lemma HI u* is contained entirely within V 
or W. The analysis is identical when our initial point u satisfies f r (u) < j m . 

Observe that i 7 V\span(B)-^W'\span(B) i s a factorizable distribution over tt b ±(x), and hence a local 
optimum in B 1 - also will lie in either V or W. □ 

The next theorem states that ExtendBasis finds all vectors which are correlated with SC^, 
and that all remaining vectors at the end of the algorithm are uncorrelated up to the m th moment. 

Theorem 4 (Basis Extension). The output S' of ExtendBasis on input S QV,T satisfies: 

1. scs'cy. 

2. For {v t } C S' and {uj C (S') 1 : 

e f n>v) ri(x T ^)) = e ( fiix^uA e ( n^)) 

\i=l t=l J \i=l / \t=l / 
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Algorithm 2 ExtendBasis 



Input: Moment bound m, distribution F, candidate vectors S and non-Gaussian directions T. 
1: S' ^S,j^ 2. 

while | S' | < k and j < m do 

for each choice (with repetition) {v±, . . . ,vi} C S' where 1 < I < j. do 



Compute the (j — I) tensor M t •' so that for any u G (<S n T)- 



if g(u) = then 

Continue with next choice of {v{\. 
else 

if g(u) > for any u then 

Compute a local maximum u* to 5 starting with u/ \\u\\ 
else 

Compute a local minimum u* to g starting with u/ \\u\\ 
S' ^— S' U {u*} and restart the while loop with j = 3. 

return 5'. 



Proof of Theorem [^J The Schwartz-Zippel lemma returns a correct decision at every iteration (there 
are at most n k of these, so if we pick our domain to be of size 2n fc and run 0(logn fc /S) iterations 
each time, then we have a correct decision for all iterations with probability at least 1 — 5. 

Let u* be a local maximum found by ExtendBasis using the j th moment. Consider the 
{v\, . . . , v{\ where u* was found. 

g(u) = E ^(a^u)^ L[( xT ^ " E ((z Tu ) W_0 ) E ^Q(z T ^ 



E 

i=0 



t=l 



)E((x T W) 

Since u* was found at moment j, then for all < i < j — I: 
E ( (x T u v )ti- 1 -^ 



E (x T u v )«- l -V H(x T v t ) - E H^ T ^) E ((* T «v) 



vt=l 



x uy 



,tf-»-0 



Only the first and last terms survive: 
/ J 



</(«) = e (x t u V )v-v n(x T v t ) - e no^) e (o&v) 



t=l 



vt=l 



\Cj-0 



vt=l 



Having a positive local maximum implies that [|ity|| = 1. 
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For the second part of this lemma: we already know that all the remaining vectors must have 
Gaussian moments. Fix j < m and a choice of {vi,...vi} from S' and consider the symmetric 

tensor M represented by f(u) — E ((a; T w) J ' — ') E ^nt=i( a;Tu *)) • We require the following claim for 

symmetric tensors where for any permutation a : [m] — > [m\: 



Claim 6. If A is a symmetric order r tensor, then: 



Using Claim [6j 



max A(v, . . . , v) = max A(v\, . . . , v r ) 

\\v\\ = l ||ui|| = l,...,||iv|| = l 



max M(u, . . . , u) = max M(ui, . . . , 



ll«ll=l ||u 1 ||=l,...,||u,_ J ||=l 



I I ' 3 ~ 

j- ' 



In particular, there exists {ui} such that 

'i-i i \ /i-i 



1 f Ui^ut) fl(x T ^)) > E fn^)) E ( Yl^vt)) 

\i=l t=l J \i=l / \t=l J 



if and only if there exists u such that 



e Ux T uy- l l^x T v t )\ > e [(x T u y- 1 ) E fll^^)! 

But at the end of the algorithm, we know that there are no such u, hence there can be no such m 
either. Thus, we can factor any u ^ S' through the expectations which contain only v% from S' . □ 



3.1 Generalized ICA 

We can now give an algorithm for generalized ICA, assuming access to exact moment tensors and 
local optima. If F is factorizable, Algorithm Factor will provide a factoring into subspaces such 
that the marginal distributions look independent up to m moments. The output of FindBasis is a 
set of vectors that each lie in V or in W. We try all possibilities for the subset from V, then extend 
this using ExtendBasis, consider the resulting decomposition and take the option that gives a 
product factorization. The factorization found will be U, U 1 - for some U C V. 

Theorem 5 (Factoring, general noise). For any e, S > 0, given the moment tensors of distribution F 
over M. n and the ability to compute exact local optima, if there exists a subspace V with dim{V) = k 
such that for j = l,...,m dj(F, FyFw) = 0, Algorithm Factor finds a subspace U such that 
d(F, Fi/Fjj±) = with probability at least 1 — 5. The time and sample complexity of the algorithm 
are bounded by j7,°( fc + m ) . 

Proof of Theorem^ In the enumeration of all subsets of size at most k subsets of B at line 2, we 
encounter T = B Pi V. By Theorem HI the output S' of ExtendBasis contains only vectors in V 
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Algorithm 3 Factor 

Input: Highest moment m, distribution F. 
1: B <s— FindBasis(m, F). 

2: for every subset T C B of at most k vectors do 
3: S' <- ExtendBasis (m, F,T, B — T). 
4: T' «- ExtendBasis(m, F, B - T, S'). 
5: if 1 5" | >fcor |T'| >n-k then 
6: Continue with the next choice of T. 

7: Augment S' with fc — \S'\ orthonormal vectors from M. n — span (T"), forming basis J7. 
8: Compute m moments of F, Fjj and F[/±: 
9: for Kmdo 
10: Compute: 




where {x pi , . . . ,x Pk } correspond to coordinates in U, and {x Pk+1 , ... ,x pi } correspond to 
coordinates in the U 1 - subspace. 
11: return U with lowest Afj, breaking ties by considering Afj, A^, . . . 



and T' contains only vectors from W. By Part 2 of Theorem 01 the following holds for any choice 
of vector {ui} C S n_1 — span (S', T'), we have E ((a^-Uj) 3 ) = 7 m for j < m and that: 

E ( fi(x T Ul ) Yl^vA = E (ll^uA E (il(x T v t )) 

\i=i t=i J \t=i / \t=i J 

for vt £ S' U T'. In particular, every such u is independent from S' and T' up to the m th moment. 
In the augmented basis, the expectations separate into the products over the two subspaces: 

E {xi x . . . — E [x pi . . . %pj^) E (x P j +1 . . . x pi *j 

where {x pi , . . . ,x p -} correspond to coordinates in the U subspace, and {x Pj+1 , ... ,x Pl } correspond 
to coordinates in the U ± subspace. In particular, the entries of the moment tensor of F are equal 
to the entries of the moment tensor of FuFu±_ , and hence will return A J = 0. □ 

4 Gaussian noise model 

We now give a complete algorithm assuming Fy/ is a Gaussian, assuming we only have access to F 
through samples (not exact moment tensors) . The main difficulty is handling the error accumulation 
over multiple iterations, as in each round we can only hope to approximately compute moments 
and find approximate local optima. The idea is that FindBasis and ExtendBasis find vectors 
where E ((x T u) m ) / j m . If F\y is Gaussian, our algorithms only find directions in V. Thus, the 
error accumulates over only k steps, and the total error depends on k rather than n. 
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4.1 Local search 



To compute approximate local optima, we perform gradient ascent, moving in the direction of the 
gradient. If moving along the gradient does not increase the function value by a certain value, 
we switch to second-order moves based on the Hessian. We will use the notation that Dg u is the 
gradient of g at u and D 2 g u for the Hessian matrix. The top eigenvalue of a matrix on a subspace 
orthogonal to a vector can be computed via a coordinate transformation. We denote M = ||M m || 2 . 
LocalOpt terminates in polynomial time when the parameters e±, ri, e 2 and r 2 , the thresholds and 

Algorithm 4 LocalOpt 
Input: Function g, error parameter e\, 
1: u <— uniformly at random over unit sphere. 

2: while \(u,Dg u )\ < (1 - ei) \\Dg u \\ or \ max (D 2 g u ) > e 2 on u 1 do 

3: if \(u,Dg u )\ < (1-ei) \\Dg u \\ then 

4: Direction v ir u ± (Dg u ). 

5: u <— u + r\vj \v\ . 

6: Renormalize u <— uj ||u||. 

7: else if X m ax{D 2 g u ) > €2 on ti -1 then 

8: Direction v <— top eigenvector of D 2 g u on . 

9: U ^ U — T2vj ||«|| . 

10: Renormalize it n/ ||it||. 
11: return u. 



step sizes for the first-order moves and second-order moves are chosen appropriately. Note that e 2 
varies with the function value, but the remaining parameters are fixed. 

Lemma 7 (Local search termination). Let g(u) satisfy g(tu) = t m g(u) for some integer m. Suppose 
that for our starting point u that g(u) > r\ > 0. Choose the parameters as follows: 

f 8lm(m — l) 2 ri 2 \ 2 Je{ 
£ 1 < T7T77777 »"1 < 



1048M j ~ 4m 2 M 

3m(m — l)g(u) 9r] 
£2 = : r 2 < 



256(m - 2)M 

where M is an upper bound for g on the unit sphere. Then LocalOpt will terminate in at most 
poly(M, m, 1/ei, 1/ri, l/r2, 1/77) iterations. 

Proof of Lemma [?| Consider an iteration of the algorithm where the first derivative condition is 
unsatisfied, and we make a step of size r\ in the direction of v/ ||u|| (call the step h). The new 
function value at this point u + h is given by the Taylor series expansion with error (where £ lies 
between u and u + h): 

g(u + h)= g{u) +Dg u -h + \h T {D 2 g c )h 
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The increase in function value is lower bounded as follows: 

Dg u ■ h + h T D 2 g c h > n (^Dg u , ^ - ±r 2 (v/ \\v\\) T D 2 g c (v / \\v\\) 

> n^-Klm 2 M 

i— 1 i— 

> ri-s/ei - g r iV £ i 

7 ^_ 

Thus, we have lower bounded the increase in the function value. When we rescale u + h back to 
norm 1, we can apply the m-homogeneity of / to deduce that: 

u + h \ 1 / , 7 \ 

-g{u + h) 



\u + H) \\u + h\\ m/2 ' 
We can compute \\u + h\\ = 1 + r 2 because r\ is perpendicular to u. Hence: 

u + h \ 1 f ,u\ 

g(u + h) 



u + h\\J (l + rf) m /2 



>(l-^rn g(u + h) 



where we used the estimate: 



{l + x) k >l + (k + \/2)x 
for x < 2/k 2 . To finish this calculation, we simply substitute our value for T\ in terms of e\: 

9 i^rki) - 9{u) i 1 + ^b ri ^) i 1 " m ri ^) 
* 9{u) i 1 + i ri ^) 

Hence, there are at most at most a polynomial number of iterations of this form. Consider now 
an iteration where the second derivative condition is unsatisfied (and the first derivative condition 
must be satisfied) . We take the same Taylor series expansion with error term (to one further term) , 
where h = r^vj ||w||: 

g(u + h)= g(u) +Dg u -h + h T D 2 g u h + ^D 3 g c (h, h, h) 

We will show that the contributions from the first and third derivative terms are small, and that 
the second derivative term dominates. In the first derivative term, note that h is orthogonal to u, 
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and hence the component of Dg u parallel to h has norm at most \/2ei — e 2 ||L>g u ||. We estimate 
the other terms as before: 

„ 1 t 9 1 q x / o 1 9 m(m— l)(m — 2) o 

• /i + ^ D 9uh + - J D 3 9c(/i, /i, /i) > " V2ei - e 2 mM + -r 2 e 2 - f -rf M 

31 2 
^ ol^ 2 

Once again, we have to rescale back to norm 1. In this case: 

u + h \ ( 31 2 m + 1 2 

^Th\\ ) " I 1 + 64^) ^ " ^ r2 

/ 93 2 m + 1 2 

> q(u) 1 H m(m — l)r 9 r 9 

~ yv ; V 256 V 12 2 2 

/ x ( 93 2 

^ 9{U) + 256 r2 

The last bound follows because the worst possible lower bound occurs at m = 3. Hence, there are 
only a polynomial number of iterations of this form as well. □ 

4.2 Exact moments, approximate local optima 

We are now ready to extend the analysis of Theorem [3] to the case when we have access to the 
exact moment tensor, but instead of using exact moments, we will use LocalOpt with appropriately 
chosen e\. On the other hand, using a weaker local optimum algorithm will also give us weaker 
guarantees on the quality of the output, giving a weaker form of Lemma HI Over W 1 (instead of 
S n_1 ), Lemma [3] gives us a formula of the form: 

fmiu) = \\u v \r (e {(xiu° v r) - 7m ) + \\u w \\ m (e {(xiu° w r) - 7m ) + lm nr 

We will optimise the function g(u) = f m (u) —j m \\ u \\ m using LocalOpt over the unit sphere. This 
is equivalent to optimising f m and simplifies our derivative calculations. 

Lemma 8 (Exact moments, inexact optima). Let F = FyF\\r have the same first m — 1 moments 
as a Gaussian but a different m th moment. Suppose we run LocalOpt on g(u), starting from a 
point u where g(u) > rj, setting e± < mrj 2 ^ m ~~ 2 ' > /M 2 l(' m ~ 2 \ After poly(rj, l/e\,rf) iterations, we will 
have a point u* where either ||7iy(u*)|| > 1 — 16ei or [|-7rw(u*)|| > 1 — 16ei. 

Proof. We proceed as in LemmaHl u* lies on a curve C = {s(uy)° + t(u^ v )° : s 2 + t 2 = 1, s > 0, t > 
0}. We will show that neither s nor t is bounded away from and 1. 

Restricted to the curve g(u) = g(s, t) = a v s m + a w t m . Suppose that a w < 0, then we must have 
that s > {n/M) l / m . In this direct calculation comparing (Dg u ,u) = ma v s m 1 + ma w t m 1 

with ||-D<7u|| = m-\/a 2 s 2 ( m_1 ) + a^t 2 ^" 1-1 ) will yield s > 1 — 2e\. Thus, we may assume that both 
a v and a w are positive, and that a v > a w . 

Suppose that for a unit vector u, we have s, t > 16^/ei, and the first-order gradient condition: 

(Dg u ,u) 
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then, 



3m(m — l)g(u) 



(where the eigenvalue is taken only in the subspace orthogonal to it). Thus, the algorithm continues 
making progress at such a vector u. To do this, we lower bound the top eigenvalue by the quadratic 
form in the direction —tu v + su?^, which is orthogonal to 



u. 



,(D 2 g u ) > (-tu v + su° w ) T D 2 g u (-tu v + su° w ) 

= m(m - \){a v s m - 2 t 2 + a w s 2 t m - 2 ) 

' t 2 /s 
s 2 /t 



m{m - l){a v s m - 1 l( i w t m - 1 ) 



i.m— 1\ 



By construction, u has two nonzero coordinates, taking values s and t and all other coordinates 
zero. Dg u has partial derivatives ma v s m ~ 1 and ma w t m ~ l in these directions. Thus, 



(Dg u ,u) 
\\Dg u \\ 



< 



a v s 



m—1 
m—1 



a v s m ~ l 

Qifi) t 



Thus we obtain the condition that: 

m— 1 



a v s ' 

i 

CLqnb 



m—1 



(1 



a v s 

dr, nt 



m—1 
m—1 



+ V2e 



a v s 



m—1 
m—1 



where < e < e± and r is a unit vector orthogonal to (s,i). Substituting this into the previous 
equation: 



A m ax(-D 2 sO > m(m - 1) 



> m(m — 1) 



(1 



a v s 

CLwt 



m—1 
m—1 



+ 



\/2e 



a v s 

dint 



m—1 



m—1 



> m(m — 1) 



a v s 

hu 

a v s 

hu 



m—1 



dint 



m—1 



(1 
(1 



0- V2e 



2\/2t 



1 1 

S + l 
1 



16Ve 



> 



3m(m — l)g(u) 



t 2 /s 
s 2 /t 



where the last estimate follows from the Cauchy- Schwartz inequality. 



□ 



4.3 Approximate moments and approximate local optima 

By using the robust Schwartz Zippel Lemma (Lemma ll2p instead of the usual form, and LocalOpt 
at Lines 10 and 11 of FindBasis and Lines 13 and 15 of ExtendBasis, we can obtain an efficient 
randomized algorithm. The major difficulty remaining is that we must bound the error incurred 
every time we call LocalOpt. The error analysis is technical: the idea is to obtain approximate 
versions of Lemmas [3] and HI and to show that LocalOpt behaves well on these approximate 
versions. Consider the first iteration: 
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Lemma 9 (Two steps). Let x have the same first m — 1 moments as a Gaussian but a different 
m th moment. Let u% = y/Y--5vi — V5wi be the vector found in the first iteration of FindBasis, 
where v\ and w\ are unit vectors in V and W respectively. Suppose we run LocalOpt on g{u) = 
fm(u) — 1m \\u\\ m on the orthogonal subspace u\ , starting from a point u where g{u) >n = M5 1 / 16 , 
setting e\ < n ^J/(m-2) ~ 60m 2 M 2 <5 5//16 as the error parameter in LocalOpt. After poly(n, l/ei,rj) 
iterations, we will have a point u* where either ||7iy(u*)|| > 1 — S 1 ^ 8 or ||7rw(tt*)|| > 1 — (5 1 / 8 

Tthe sequence of ideas in this proof is not unlike the proofs in Section [3) first we derive a 
nice representation of f m (cf Lemma [3j then we analyse the support of a local optimum under this 
representation (cf Lemma H]) - we are not able to claim that the local optimum found is contained 
wholly in V or W, but since we are satisfied with approximate local optima, we can bound the 
components around and 1. All through this, we must bound the error, and try to push through 
the calculations of Lemma [HJ 

Proof of Lemma\Q First, we will construct an orthonormal basis which includes u\: extend {v\} 
and {wi} to orthonormal bases {vi} and {w;} of V and W respectively. Replace v% and w\ with 
the following two vectors: 

u\ = vT — Svi — ySwi 
u\ = vSvi + Vl — Swi 



Thus our basis will be {u\,ui,V2, • • • , Vk,W2, ■ ■ ■ , w{\. For a vector x = (x\, . . . , x n ) in the basis of 
{v^ and {u>i}, we now have: 

x = (Vl — 5x\ - VSx 2 , Vdxi + Vl — 5X2, xs, • • • , x n ) 

which is simply a rotation (unitary transformation) in the first two coordinates. 

We evaluate the m th moment on the subspace orthogonal to u\. Let £ be a point on this 
orthogonal subspace: note that £ has component in the first coordinate: 

/ m (e) = E((x T £D 



E 



We can break up the argument into two dot products, which are independent of each other. More- 
over, observe that the norm of the two constituent parts of the £ vector taken together is still 
1. 

X £, = (%1 j X V2 , • • • , X Vf _ ) ( V^5^2 1 £-02 ' ' ' ' ' 6ufc ) i.X2i X W2 , . . . , X Wl ) (\/l 0~£,2i £,W2 j • • • > £to; ) 

Hence, we can apply Lemma [3) this gives a perturbed version of Lemma [3l 

/ h \ m / 2 

f m (o = Ue 2 + J2 E (((^i.^. • • • > z«j t (v^2, • • • , ufr - 7m) + 
/ 1 \ m/2 

f (1 ~ S )Ci +^2£Li J E (\[{X2,X W21 .. . ,X Wl f(\/l - 5&,€w 2 ,-- ■ ,6^)°) -7m)+7m 
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Fixing a point £* G D S ra_1 : we will give a curve C which passes through this point and 
remains on the unit sphere. We will analyse the value of the g(£) = f m (0 ~ 7m ||£|| m on this curve 
- as before, every point which is a local optimum on S n_1 has to be a local optimum on C as well. 
Thus by studying the local optima over C, we will be able to describe the strucure of the local 
optima in full space. 

We may assume that all the £* are nonnegative - otherwise we can pick simply negate the 
associated basis vector. We take the following as the components for C: 

C= , = =(o,o,C 2 ,...,C fc ,o,...,o) 
T- =2 (^) 2 

1 



Q = (1,0,..., 0) 

Since these are the only three directions that change along C, we will use these three vectors as an 
orthonormal basis. Now, defining the following quantity: 



« = fe*) 2 / ((i - SM) 2 + E(^) 2 J 



we can write our curve C as: 

C = {yC + <, + V^SzQ ■■ y 2 + (1 + Sa)z 2 = 1, y, z > 0} 

Specifically, we will use the basis £* and (1 + a<5) -1 (£^ + £*). Note that in this basis, y is precisely 
the coordinate along the first basis vector and (1 + 5a) l l 2 z is the coordinate along the second 
basis vector. Denote this latter quantity by z' , then by the chain rule, we have that d/dz' = 
{l + 5a)- l / 2 d/dz. 

Restricted to C, the expectation terms simplify: note that — <5£2,£w> 2 > • • • >6«i) remains 
constant on C, so the second expectation term reduces to a constant, which we will denote with 
a w . The first expectation term does not remain constant, because there is an additional component 
in the direction of v±, but this component always has a small magnitude. With a change of basis, 
we can simplify this expression to involving only y and z: 

E (((xi, x V2 , . . . , x Vk ) T (V5&,Z V2 , . . . , £ Vh ) ) m ) = E ( ((xi , s e ) T ( V^6z, y)) ") 

We will denote the first expectation term by a v . In full, our objective function on C is given by: 

g(£) = [fe 2 +y 2 ]( m / 2 )E(((x 1 ,x,) T (v / o^z,y) ) m -7 m ) + a w z m 
= [6az 2 + y 2 ]^a v (y,z) + a w z m 

Next we will examine the local optima on C: let £ be the output of of LocalOpt: we will show 
that £ has large projection with either the V or W subspace (cf Lemma H]). We will analyse the 
following cases: 

1. y 2 < 5 1 / 4 or z 2 < 5^. 
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2. y 2 > and z 2 > 1/3. 

3. z 2 > 5 1 ' 4 and y 2 > 1/3. 

Case 1: Suppose that y 2 < 5 1 / 4 , then we must have z > yl — 5 1 / 4 — ad. The approximate local 
optimum u that we compute has projection at least \/l — 5 on this local optimum, and hence, the 
projection of u onto w is at least: 

\\irw(u)\\ > (1 - 5){1 - - a5) - V5 

> 1-5/2- <5 1/4 /2 -aS-VS 

> 1 - 5 l ' A 



In this case, for sufficiently small 5, we have: 

hw{u)\\ 2 >l-5 l/s 

The argument for when z 2 < J 1 / 4 is identical. 

Case 2: We will prove that LocalOpt can not terminate in this region by carrying out the 
calculations of Lemma [5] whilst keeping track of errors. Thus, let £ be a point in this range, we 
will show that if the first derivative condition in LocalOpt is satisfied, then the second derivative 
condition is unsatisfied, thus LocalOpt can not terminate at £. First, let us examine how / changes 
over C: 

Claim 10 (First partials under perturbations). In the range where y 2 > 5 1 / 4 and z 2 > 1/3, 

< 3mMVd 



ma v y 

dy 

dg_ 

dz 



m—l 



ma m z 



m—l 



< AmM\f5 

As a corollary, via the triangle inequality, we have that: 



\\{gy,g z )\\>m\\{a v y m -\a w z m - 1 )\\-bmM^5 
Claim 11 (Second partials under perturbations). In the range where y 2 > J 1 / 4 and z 2 > 1/3: 



9 2 9 , m-2 

^-^ - m(m - l)a v y 
dz 2 



m(m — l)a„,z 



m-2 



d 2 g 



dydz 



< c vv m 2 MV5 

< c ww m 2 M\[5 

< c vw m 2 MV5 



where c vv , c vw and c ww are absolute constants bounded by 20. 

Throughout the rest of this calculation, we will use the basis of n — 1 vectors consisting of 
{£*, (1 + a<5) -1 (£^, + £*)}, and any orthonormal extension to . In particular, in this basis, 
£ = (y,z',0,...,0). 
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As before, we will lower bound the contribution of the second derivative term. Our direction of 
movement will be (—z f , y, 0, . . . , 0). This vector is clearly a unit vector orthogonal to £. where top 
eigenvalue is taken orthogonal to £. 



A m axp 2 ^) > {-z\y)D 2 9i 



y 



> (-VlT^5z,y) ( 9vy 9z ' v \ ( 
V 9z'y 9z'z' J V 



-yi + <5az 

y 

We can further use Claim [TT] to simplify the other components of the quadratic form: 

A m ax(-D 2 g%) > (1 + 5a)z 2 g yy + y 2 g z ' z i — 2c vw m 

z 2 /y 



J — (1 + (5a)(c 22 + c ra + 2c zw )m 



> {\ + 8a)m{m-l){a v y m - L ,a w z m -') , 2 , 

\y i z 

Our first derivative condition is given by: 

Since £ = (y, z' , 0, . . . , 0) has only two nonzero components, we need only evaluate two components 
of the derivative: furthermore, we can lower bound the norm ||.Dfl£|| > HGfyjfl'z')!!) w hich gives the 
following lower bound: 



(9y,9z') ( V z , 
\\(9y,9z')\\ 

Rearranging, and applying Claim [TOl yields: 



> 1-ei 



m{a v y m ~\a w z m - 1 ) ( \ \ > (1 - ei) \\(g y ,g z >)\\ - 7mM^5 

> m(l - ei) \\(a v y m - 1 ,a w z m - 1 )\\ - UmMVd 

> m (l _ ei _ l^V?) ||( m-1 m^N 

Thus, W6 Cciri rewrite this relationship for unit vector v orthogonal to {cl v y m 1 ,a w z m 1 ) and < 

< + 12MVI. 
— 1 »7 

) = (l-e)!!^™- 1 ,^™- 1 )!! f * )+V2 e -e 2 ||(a l) y m - 1 ,a tu z m - 1 )||r 

"ID* / V / 

Substituting this back into our lower bound for A max yields: 
A max > (1 + Sa)(l - e) || Ky m - 1 ,a u ,z m - 1 )|| (z 2 + y 2 ) - \/2e - e 2 || (a^ 1 , a^^ 1 ) || Q + ^ 
- SOr^Mv 7 ^ 

> (1 + <5a)(l - 5 1/6 )m(m - l)/(£) - \/25 1/24 - SOr^Mv 7 ^ 
3 

> -m(m - !)/(£) 
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where we used the Cauchy-Schwartz inequality for: 



,m— 1 



i Q"in Zn 



1) > a v y m + a w z m > g(£) - mMVaS 



Case 3: This case follows from the exactly the same analysis as above. It is in fact substantially 
easier, as the denominator terms a5z + y are in fact all bounded by constants now, and hence the 
numerator is small enough in almost all cases above to bound the terms. 

We now provide the proofs for the claims regarding the coefficients a v and a w . In explicitly 
taking derivatives, it is important to note the following: 

a v = E 



({x 1 ,x v ) T (VaSz,y)°^ ) - j m 

E ( ( (xi,x v ) T (Va5z,y) 



1 



(a5z 2 + y 2)(m/2)- 



7rj 



For ease of notation, denote (f> = (VaSz, y), we will suppress all but one 4> argument in our moment 
tensors, thus we will write A{<j>) instead of A((p, ...,(/>), and A((j), e±) instead of A((p, . . . , cf), e±). If 
A is a m th order tensor, its derivative has components given by (DA^i = mA((f), e\) where A takes 
(m — 1) copies of cj>. We also have the Hessian D 2 : (D 2 A ( f ) )ij = m(m — 1)A(4>, ej, ej). We can bound 
the spectral norm of D A using Claim [6j, which yields \rnax 

(D 2 A) < m(m - 1)M. 

Proof of Claim [771 Firstly, we have: 



= my{5az 2 +y 2 )^~ l a v + (5az 2 + y 2 )W 2 )^ 



^ = mz m - l a w + ma5z(5az 2 + y 2 ) {m l 2 ^ l a v + (5az 2 + y 2 )^l 2 )^L 
oz : 1 • 



dz 



The ma5z(5az 2 + y 2 )( m / 2 ) 1 a v is upper bounded in absolute value in mM5. Similarly, it is also 
clear that: 

my{5az 2 + y 2 )^ 12 ^ 1 ^ - ma v y m - 1 < mM^/S 
Thus it remains to show that the partial derivative terms are small: 



(Saz 2 + y 2 )(™/2)^ = {6az 2 + 2 )(m /2) 

oy 



-my 



m 



m- 



(5az 2 + y 2 

-yVa6zA((f), e 2 ) - y 2 A((f>, ei) 
adz 2 + y 2 

-yy/a5zA((f), e 2 ) + a5z 2 A(<f>, e\) 
abz 2 + y 2 



y . m 

)W 2 )+i (0, 0) + (5az 2 + y 2 )W 2 ) 



A(</>, ei ) 



A(</>,ei) 



When we have a term like A((j), . . . ,(f), e±), the arguments are not normalised. In particular: 

A(</>, . . . , 0, ei) = (5az 2 + y 2 )^/ 2 A{cjP , ei) 
Thus, normalising gives: 



(6az z + y 2 )^ 2 ^ 
oy 



da v 



< mM \/8 + m5M 

< 2mMV6 
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For the other partial derivative, we want to compute: 

—mabz 



(8az 2 + j, 2 )W 2 '7 = (5az 2 + y 2 )^ 



, da v 
dz 

= ra\fab 

Applying the same method: 



-A(<p,cP) + 



myab 



(a5z 2 +y2)(m/2)+l VV.YV ( a6z2 +y2 )m/2 

^5zA((j), (fy) + abz 2 A(</>, e 2 ) + y 2 A(<p, e 2 )\ 



A(<t>,e 2 ) 



abz 2 + y 2 



(<W+y 2 ) (m/2) ^ 
oz 



< 3mMVd 



Hence combining this with our earlier bound, we have the desired inequality. 
Proof of Claim By direct calculation, we obtain: 



□ 



d 2 9 



,<9 2 



^§ = (5az 2 + y 2 ) (m/2) ^ + 2my(5az 2 + y 2 )^/2)-i^z + + y 2 )^/ 2 )- 2 ( 8az 2 + ( m _ i)/) 

We now estimate the three terms in this sum - the first two terms will be of order s/S, and the last 
term will give us approximately m(m — l)a v y m ~ 2 . 

(Saz 2 +y 2 )^ d2av 



dy 2 

(baz 2 +y 2 )^{ 



-m 2 y 



{a5z 2 + y 2 ) W 2 )+! 



-yya6zA((f), e 2 ) ^ abz 2 A((f), e\) 



abz 2 + y 2 



abz 2 + y 2 



+ 



m 



{abz 2 + y 2 ) m l 2 



abzA(cf>, e 2 ) 
abz 2 + y 2 



+- 



abz 2 + y 2 



+ 



(m — l)y\/abzA(e 2 , e±) yy/aSzA((j),e 2 ) abz (m — l)A(e\,ei) —2yabzA(cj),ei) 



(abz 2 + y 2 ) 2 



-m 2 y) 



-yVabzA((f>,e 2 ) ^ abz 2 A(4>, e\) 



(abz 2 + y 2 ) 2 (abz 2 + y 2 ) 2 



+ 



+ m 



abz 2 + y 2 

yaSzA((f), e 2 ) 
abz 2 + y 2 



+ 



(abz 2 + y 2 ) 2 

(m — l)y\/ r abzA(e 2 , e\) 
abz 2 + y 2 



y 2 VabzA(4>,e 2 ) abz 2 (m — l)A(ei, e\) —2yabz 2 A(4>,ei) 
(abz 2 + y 2 ) 2 abz 2 + y 2 (abz 2 + y 2 ) 2 

We will bound the magnitude of every term in this sum. Consider the first term of the form: 

-yVabzA(cf), e 2 ) 



(-m y)- 



(abz 2 + y 2 ) 2 



Thus, since m > 3: 



(-m y) 



2 ^-yVabzA(4>,e 2 ) 



(abz 2 + y 2 ) 2 



< m 2 


y 2 VbA((j),e 2 ) 


(a 


bz 2 + y 2 ) 2 


< 3m 2 M 


y 2 Vb 


abz 2 + y 2 



Now, y 2 /(abz 2 + y 2 ) < 1, hence: 



-m 2 y)- 



-yVabzA((f),e 2 ) 
(abz 2 + y 2 ) 2 



< 3m 2 MVb 
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Of the seven terms in the sum, the first, third and fifth terms can be analysed exactly as above, 
and their sum can be upper bounded by \hm 2 M\fb. For the remaining terms we have to use our 
lower bound on y, for example: 



-my) 



a5z 2 A(4>, e±) 



(a5z 2 + y 2 ) 2 



< mM 



< mM 



Sy 



a5z 2 + y 2 
6 

y 



< mM5 7/s 

By this reasoning, we can bound all seven terms by m 2 M\/~5, hence this term in d 2 g/dy 2 contributes 
is bounded in absolute value by 7m 2 M\/S. For the second term in that expression, the analysis is 
almost identical to the previous claim and gives 



2my(5az 2 +y 2 )^- 1 ^- 

dy 



2m l 



< 2m l 



i r 2 , 2^m/2)-i {-yVa5zA(4>°, e 2 ) + aSz 2 A(4>°, ei)) 
y\0OLZ + y ) „ o\q/o 

(oaz z + y l ) A l l 

(-yVaSzA((/) ,e2) + a5z 2 A((j) , ei)) 



< 2m 2 MV6 + 2m 2 M 

< Am 2 MV5 



{5az 2 + y 2 ) 
5_ 

y 



Thus, we have: 



dy 2 



ma v (5az 2 + y 2 )^ 2 ^ 2 {5az 2 + (m - l)y 2 ) 
By applying the triangle inequality: 

ma v (5az 2 + y 2 )( m / 2 )~ 2 (5az 2 + (m - l)y 2 ) - m(m - l)a„2/ m - 2 



< 19m 2 MV5 



< m 2 MVd 



Thus we have the desired result for the second partial with respect to y. The other second derivatives 
are computed in a similar way. □ 

□ 

Using the above, we are now examine what happens after t iterations of FindBasis. The 
following theorem shows that after k iterations of FindBasis, our error blows up at most doubly 
exponentially in k. The proof holds for ExtendBasis is as well. 

Theorem 6 (Multiple iterations). Suppose FindBasis finds j < k orthogonal vectors {u±, ...,Uj} 

of g(u) taking e± such that r\ < Afe^ 16 for each call of LocalOpt, then ||7rv(nj)|| 2 > 1 — e^ 16 ^ . 

Proof of Theorem® After t iterations, we have a basis of orthonormal vectors {u\, . . . ,Ut} where 
each Uj is close to some vector in V: 

m = aii«i + bnwi 

u 2 = a 2 ivi + a 2 2V2 + b 2 iwi + 622^2 



ut = ativi H h a u v t + 6*1^1 H b tt w t 
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We use the orthonormal basis {vt+i, ■ ■ ■ ,Vk}, the remaining vectors in W {wt+i, ■ ■ ■ ,w n -k}, 

and approximate copies of {u>i, . . . , wt}- This last set is given by: 



t 

w[ = c\Wx + duVi + euw, 

i=l i=l 



w' t = c t w t + ^2 d ti Vi + ^2 euWi 
i=i i=i 



In these sums we have da = en = 0, and we have orthonormality between these vectors. Consider 
the inner product x T £, where £ is of unit length and lies in the space orthogonal to {itj}: 



li- 



fe n— k t i i 

i=t+l £=t+l 1=1 i=l i=l 



— (^Dl 3 • • • 3 X Vt , X Vf+1 , . . • , iCufe ) Cw'^il i • • • ) ^ 1 Cw'^iti £v t +i i • • • 3 £u/s 

1=1 i=l 

t t 

+ (^UJl 3 ■ • • 3 ^U>t 5 "^Ult + l ) • • • 3 X W n -k ) (Cwj C l "I" ^ ] Cuii e «l 3 Cw[ C t + ^ ] (.w'^it, £,Wt + l 3 • • • 3 ClU n _ k ) 

i=l i=l 

The two vectors formed from £ have total norm 1. Now, we can apply Lemma to obtain: 

n\ m/2 / o\ to/2 



t / t 



t 



y=t+i i=i \i=i 
where the expectation term a v is given by: 



a w H 



u'=*+i i=l 



i=l 



= E 



, . . • , X Vt , X?; t+:L , . . . , 2;ti fc ) (y ^w'^H, ■ ■ ■ , ^ Cw'^iti (,v t+ i i ■ ■ ■ ) £,v k ) 



1=1 



1=1 



(and similarly for a w ). As in the single iteration case, we restrict to a curve. Fix an output £* of 
FindBasis: we will fix the ratio of the components {£ W j} m the ratio of £*, and similarly, we will 
fix the ratios of {£ w i , . . . , £ w i , £ Wt+1 , ■ ■ ■ ,(, Wn _ k } according to as well. This gives the following 
restriction on our curve after subtracting 7 m ||£|| m . 



5(0 = a v 



t / t 



j=i \i=i , 



m/2 



+ a w (z 



l\m 



where I is a constant given by: 
I 
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The coefficient of z' 2 is bounded by at most 2i(eJ //16 )', hence using the previous lemma for a single 
iteration, the output produced here is a (t + l) th vector ut+i such that: 



(^+i,n*)>l-(2t(el /16 )*) 1/8 



> i _ ( e V 16 )*+i 

for sufficiently small ei (relative to A;). □ 

4.4 Algorithms 

Using LocalOpt, we have an algorithm for factoring (Problem [1]) . To deal with the errors intro- 
duced by sampling and approximate local optima, we replace the Schwartz-Zippel step in Find- 
Basis with the robust version in Lemma [T2l where we set the error parameter of the robust 
Schwartz-Zippel lemma to be (rj — \\M m \\ 2 e)/n m . We will use the following robust version of the 
Schwartz-Zippel identity test. 

Lemma 12 (Robust Schwartz-Zippel). Let p be a degree m polynomial over n variables and K a 
convex body in W 1 . If there exists x G K such that \p{x)\ > e(2cn) m , then for I random points Si, 
Pr(y Si :\p( Si )\<e)<2- 1 . 

4.5 Robust Schwartz-Zippel lemma 

Proof of LemmaWA Let \x denote the uniform measure over K, by Corollary 2 of Carbery and 
Wright [12] : 

max\p(x)\ 1/m e- 1/m fi({x G K : \p{x)\ < e}) < cn 

Consider our I samples - there are two possibilities: 

1. fi({x G K : \p{x)\ < e}) > 1/2. In this case, we have \p{x)\ < e(2cn) from the bound above. 

2. fi({x G K : \p{x)\ < e}) < 1/2. Then, Pr {Mi \p(xi)\ < e) < 1/2*. 

□ 

We can of course amplify this probability by repeating the test (or simply taking I larger). 

Algorithm 5 FactorUnderGaussian 
Input: Highest moment m, distribution F. 

1: B <r- FindBasis(m, F). 

2: U ^— ExtendBasis(m, F, B, 4>). 

3: return U 



Proof of Theorem [7J We choose e\ (the first step local iteration) to be: 

(ei)(i6) fc < min{e,77- ||M m || 2 e} 
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where ||M m || 2 is the 2- norm (spectral norm) of the m th moment tensor. We take enough samples so 
that each estimated moment in W is within min(ei,r/ — ||M m || 2 )/n m ) of the Gaussian moment, and 
every moment in V is off by at most min(ex/2, rj/2). In particular, note that all sampled gradients 
and Hessian matrices take a value which differ by no more than e\/2 from their true values. Thus, 
we can simply absorb this as part of the error arising from local search. Also, this gives us an upper 
bound on sample complexity - the number of samples it takes to estimate the m th moments of a 
Gaussian distribution to accuracy e in M. n is given as C m e _2 n m//2 log n [22], which when evaluated 
becomes n olyrn \ 

At each iteration of the algorithm, we run the Robust Schwartz-Zippel test \og{k/8) times with 
Schwartz-Zippel parameter rj — ||M m || 2 e. With probability at least 1 — 5, either each iteration 
produces a u, where |E ((x T u) m ) — 7m] > V ~ II M m || 2 e or we correctly deduce that there are no 
more directions whose moments differ from a Gaussian by more than (77 — ||M m || 2 e)/n m . In the 
latter case, by moment distinguishability, every vector in P, the minimally relevant subspace, has 
large projection on the basis {ui}. 

In the former case, we know that every unit vector in {ui} 1 - with projection at least 1 — e 
takes value which is bounded away from j m by at least 77 — ||M m || 2 e, thus every such vector is still 
moment distinguishable. Applying Theorem [6] then, we sequentially generate a sequence of at most 
k orthogonal Uj such that: 

|(^,7T V (n J ))|>l-(e 1 )( 1 / 16 ) 1 

We need to show that in addition d m (F, FuFu±) < e. Let F' = FuFu±: the moment-distance 
between the true and sampled distributions differ by at most e\, it suffices for us to prove that 
d m (F,F') < e. To this end, we will apply the representation formula to F' for some fixed unit 
vector u. As before, we have: 

E F , ((x T u) m ) = E F , {{x T uu) m )) + E F , ((a?u v ±) m ) - lm \\ Uu \r - lm \\u v ± || m + 7m 
= E F {(x T uu) m )) + E F {(x T Uu ^ m ) - lm \\ Uu \\ m - lm Hn^x || m + lm 

Hence, comparing with a similar expression for E F ((x r it) m ): 

\E F , {(x T u) m ) - E F {(x T u) m ) I < \E F ((x T Uu ) m )) - E F ((x T u v ) m )) \ + 

+ \E F , ((iVD-EF {{ X T UyS.) m )\ 

I II II Tf~l || || 7TI I III II TYl || Ml 

+ |7m||«C/|| - ||«V|| I + lm \\\UU\\ -||«[/-L||| 

Now, viewing E F ((x T n) m ) as the tensor applied to u, we see that we can bound these terms by 
the tensor spectral norm: 

\E F > {(x T u) m ) - E F {(x T u) m ) I < (||M m || 2 + m lm ) \\u v -u v \\ + {\\M m \\ 2 + m 7m ) \\u v ± - u v ± || 

By choice of U, we have \\ujj — uy\\ < e, and similarly for the othogonal component, thus we have 
our bound. □ 

We now apply these methods to learning the concept class T~L (Problem [2|) . After applying an 
isotropic transformation, F will have Gaussian moments in every direction orthogonal to V, and 
hence the output basis of FindBasis and ExtendBasis returns only vectors in the V subspace. 

The proof of this algorithm is straightforward given the proof of the factoring algorithm under 
Gaussian noise and our robustness assumptions. We will use the following proposition on robust 
learnability (see e.g., [2]). 
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Algorithm 6 LearnUnder Gaussian 



Input: Highest moment m, distribution F. 
1: Bi <— FindBasis(m, F). 

2: B2 <— FindBasis(m, F + ) on the space orthogonal to B\. 

3: Alternately run ExtendBasis on F and F + to find a basis U of size at most k. Extend this 

to a basis of dimension k. 
4: Draw sufficient samples S to learn T~L on this k dimensional subspace. Project S to span (U). 
5: Learn T~L over U. 



Proposition 7 (VC dimension). Let % be a hypothesis class with VC dimension d. Let t G % be 
a subspace junta with relevant subspace V , where dim(V) = k. Let U be a k dimensional subspace 
where 1(ttu) labels a 1 — e fraction of points correctly. Then we can learn I with sample complexity 
( 1 / e ) C2 diog(i/e)+ C2 iog(2/5) wUh probabi i it y at 1-8. 

Proof of Theorem^ T~L is robust; by assumption there exists e' which is polynomial in e such that 
e' + g(e') < e/2. We will take this e' and will use the following ei for all our calls to LocalOpt: 

(€ X )(h)* < mm{e\n - ||M m || 2 e} 

Under these parameters, the proof for the factoring steps of Lines 1-3 are as in FactorUnder- 
Gaussian. Thus with probability at least 1 — 5 we will obtain an orthonormal basis {ui} where 
|(n l ,7r y (n J ))|>l-(e 1 )( 1 /i6)\ 

By moment learnability, the set of {u{\ discovered is approximately a basis for P, the minimal 
dimension relevant subspace. By our choice of ei above, we satisfy the robustness condition, i.e., 
^i 6 < e'i in which case only e/2 fraction of the points are mislabeled over span({itj}). Finally, 
we allow the remaining e/2 error to the learning algorithm, to obtain an output hypothesis which 
correctly labels 1 — e fraction of F. □ 



5 Moment distance 

In our algorithms, we terminate if all remaining directions are Gaussian in the m th moment (for 
some fixed m). We would like a guarantee that when we do this, that the random variable is in 
fact very close to being Gaussian. What follows is a set of results which quantify this idea. We first 
restrict ourselves to one random variable to introduce the analytic tools we need. In what follows, 
we use the following normalisation for our fourier transforms in R n : 



/(£) 



'-'fi.rUl.r 



This implies that the Parseval/Plancherel theorem takes the following form: 



1/0*01 dx 



(2vr)" 



/(O 



for / e L 2 (R n ). 

The core of the proof is the following statement, whose proof employs Fourier analytic tech- 
niques. We need the following standard theorem on characteristic functions (see for example |38j): 
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Theorem 8 (Characteristic functions). Let £ be a random variable with distribution function F = 
F(x) and 4>{t) = E (e**») its characteristic function. Let E(|£| n ) < oo for some n > 1, t/ien $>( r ) 
exists for all r < n and 

Moreover, we have an expression for the derivatives at 0: 

And finally we have the following Taylor series estimate with error: 

Z — ✓ r ! n \ 

r=0 

where the error term e n {t) — > as n — > oo and is bounded: 

|e n (t)|<3E(|e| n ) 

Now: 

Lemma 13 (L 2 distance from a Gaussian). Lei / G L 2 (M) 6e a probability density over R whose 
first m moments match those of a standard Gaussian (whose probability density we will denote g). 

Suppose that the Fourier transform f satisifies a tail bound that 

then: 



/(£) < c / l£l / or some c > 0, 



\f(x)-g(x)\ dx < — ^ 
Proof. We will assume for the sake of simplicity that m is even. By Parseval's formula, we have: 

,2 - 1 



\f(x)-g(x)\ z dx 



Both / and g have tail bounds: / by hypothesis, and g because the Fourier transform of a Gaussian 
is still a Gaussian. Thus if we truncate the tails in an interval [— L,L] where L = m 1 / 8 : 



/ 



-L,L] 



/(0 



dt<2j ±d£ 
4 

< — 
- L 



The Fourier transform of a Gaussian is a Gaussian, and by applying a standard Gaussian tail bound 
[18|: 



1 



e- t2 / 2 dt < 



1\ e^ 2 / 2 
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We can then combine these estimates using the triangle inequality: 



/ 

Jr, 



-L.L] 



R/[-L,L] 



/(£) +\g®\ 2 dt 



6 

< — 
- L 



In the interval [— L, L], we now apply Theorem 



k=0 



nil 



■/??,! 



Now we can bound the integral: 



\(f - gym 2 da < 



< 6 



< 



mi 



E(x m ) 



ml 



2 r-L 



6 



t 2m dt 

-L 

2L 2m+l 



(2 m /2( m /2)!)2 2m + 1 

< exp ( (2m + 1) log(L) — m log(2) — m log ( — ) + m 

2m + 1 V V 



c 

< — e" 
m 



□ 



We can also give an approximate version of this theorem: 

Lemma 14 (Approximate moments). Fix e > 0, let f 6 L 2 (R) 6e a probability density over 
whose first m moments satisfy: 



Ik 



< e 



Suppose that the Fourier transform f satisifies a tail bound that 
(for a standard Gaussian g); 



/(£) < c/ |£| for some c > ; then 



[ \f(x)-g(x)\ 2 dx<^- + c"m 2 e 2 e m 



Proof. We proceed as in the previous lemma. It suffices for us to bound the integral over the 
interval [—L,L]. We apply Cauchy-Schwarz for a termwise estimate. 

E/(**)-M**),..„* . 



fc=0 
L m 



A:! 



-(*)" + (*/(*) -*,(*)) : 



m! 



da 



fc=o 



fc! 



m! 
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We can now partition the moments into powers of 2, so consider the moments where k £ [m/2* +2 , m/2 1 ]: 
the integral of each contributing term is now: 

\ 2 



—L 



% (s*) ~ E g (a 
k\ 



exp 



< 2 (s 

< 2 (E/ (**) - E 
<2(E / (^)-E 3 (^)) 2 exp(^ T ) 



2(E/ (x fc ) -E 9 (x fc )) 2 L 2fc+1 
{2k + l)k\ 

log(m) — k log k + k 



2k + 1 



exp 



(m/^) + 2, 777 n 

1 log(m)-^log 



/ 777 \ 777 \ 
12^+2/ + ¥J 



□ 



Both of our lemmas so far in this section use a tail bound for the Fourier transform. One way 
to obtain such a tail-bound is to examine logconcave probability densities: 



Lemma 15 (Log-concave densities). Let f : 



be a logconcave density which is isotropic and 



differentiable, then 



no <2/iei. 



Proof. We start by bounding the magnitude of the Fourier transform by the integral of the deriva- 
tive. 



f(0 = [ e* x f(x)dx 

JR 



1 d 

i£ dx 

—e' — 

i£ dx 



t c i£x d f( x 



e^ x f(x)dx 
dx 



where the third line follows by integration by parts and noting that in the limit f{x) — > as 
x — > ±oo. This allows us to bound /(£) : 



f(0 



< 



dx 



Let us now turn to logconcave densities. Since / is logconcave, we can write it as f(x) = e h ^ 
where h is concave. Because / is a probability density, we must have h(x) — > — oo as x — > ±oo, 
in which case since h is concave there exists a unique interval [a, b] where h(x) takes a maximum. 
This fully determintes the sign of the derivative: h'(x) = in this interval h'(x) < for x < a 
and h'(x) > for x > b. The same signs pattern holds for /', as multiplication by e~ h ^ does 
not change the sign. We can now compute the integral by applying the fundamental theorem of 
calculus: 

p ra pb poo 

/ \f(x)\dx= / f(x)dx+ / f(x)dx+ / -f'(x)dx 

JR J — oo Ja Jb 

= lim (/(a) - f(-t)) + (f(b) - /(a)) + (-/(t) + f(b)) 

t— >oo 

= f(a) + f(b) 
= 2/(a) 
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We now apply the following lemma [31], which yields the desired result. 

Lemma 16 (Upper bound on logconcave functions). Let f be an isotropic logconcave density in 
one dimension, then \f(x)\ < 1. 



□ 



Then corollary to Lemma [T3l 



Corollary 17 (L 2 distance for logconcave densities). Let f : M. — > R be an isotropic logconcave 
density whose first m moments match a Gaussian g, then: 

\\f-g\\< ' 



m 



1/8 



Proof. First, consider the case when f(x) is differentiable. We already know that / G L 1 (IR); since 
f(x) is bounded by 1 (Lemma [To]) , then we have that f(x) G L 2 (M) because f(x) 2 < \f(x)\. We 
can now apply Theorem 1131 with the tail bound guaranteed by Lemma [T5l 

For the case when f(x) is not differentiable, we can perturb by a small Gaussian random 
variable: let X ~ /, and let Z ~ N(0, 1) be an independent normal variable. Fix a parameter 
TG [0,1]: 



Y T = (1 - r)X + ^2r + r 2 Z 

is isotropic. Moreover, since this the sum of two independent logconcave random variables, its 
density is also logconcave. Let h\ denote the density of (1 — t)X and /12 the density of V2r + t 2 Z, 
then the density of our new random variable is given by: 

/oo 
hi(x — t)h,2(t)dt 
-00 

The convolution of these two distributions is (infinitely) differentiable because /12 is (infinitely) 
differentiable: 

i {hi * ) = {i hi )* h2 = hi *{i h2 

Thus Y T satisfies the hypotheses of Lemma [T5l and we have a tail bound for Y T as long as r > 0. 

The first m moments of Y are also close to those of X: if we compute the j th moment for 
example: 

e (y/) = e n (1 - t)x + V2r + T 2 zy^j 

= (1 - r) J E (X j ) + ^ ( l \ (1 - T) j (y/2T + T 2 y- j E {X L ) E (Z^) 
i=i 

Thus we can pick r small enough so that: 

|E(Y r J ) - E(X j ) \ < e 
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for any e > 0. In the proof of Lemma [13] then, instead of the moment differences from the first 
m terms of the characteristic function being 0, we can make them arbitrarily small by choosing 
smaller r. Thus we have the conclusion of Lemma [T3l for Z. To conclude, we note that: 

lim \\h\ * h,2 — f\\ 2 = 
in which case, taking r small enough allows us to apply the triangle inequality to: 

||/-<7|| < \\f-h\\ + \\h-g\\ 

□ 

We also need a lemma to convert our I? estimates to L 1 estimates. This is not general in 
possible, but since logconcave functions have exponential tailbounds: 

Lemma 18 (L 2 to L 1 ). Let f, g : R — > R isotropic logconcave densities such that for some m > 
that: 



[ 1/0*0 - g(x)\ 2 dx < — 
J m 



then: 

clog(m) 



\f(x) - g(x)\ dx < 



m 



for some absolute constant c > 0. 

Proof. Fix L = (-) log(m), then as before: 

/ 1/0*0 - 00*01 dx = / \f(x)-g(x)\dx+ \f{x) - g(x)\dx 

J J\ X \< L J\x\>L 

We can now use tail bound for logconcave functions over the tail [21], in particular, for isotropic 
logconcave random variables X in R n , we have (for some fixed absolute constants c, C > 0: 



Pr (|||x|| — y/n\ > t^/n) < Cexp ^— era min(t,t 3 



In one dimension, this shows that the integral of our tail is bounded by C/m (after application of 
triangle inequality). Now inside the interval [— L, L], we will apply the Cauchy-Schwartz inequality: 



L 



\f(.r)-<;(:r)\(Lr< ( / *0l"^l ( f 1^1 



[-L,L] \J[-L,L] ) \J[-L,L] 

< — — log(m) 

cJm 



□ 

Proof of Theorem d The proof follows from Lemma [T4[ Corollary [T7] and Lemma [T8| noting that 
the the technique of Corollary 1171 can be applied to Lemma [Til in the same way as Lemma □ 
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6 Applications 



In this section, we give some applications of our general theorems and we some explicit bounds 
for moment-learnable triples and the running time of our algorithms on these triples. We make 
explicit in our analysis the three key contributions to runtime - how many moments are required, 
how efficiently these moments can be sampled, and how efficiently the hypothesis can be learned 
in the /c-dimensional relevant subspace. 

6.1 Moment estimation 

In this section, we highlight some further consequences and subtleties of using moments in algo- 
rithms. The use of moments is a very natural way of studying random variables. For example, the 
inequalities of Markov, Chebyshev and Chernoff are statements about the relationship between a 
finite sequence of moments and the tail of a distribution. If we consider an infinite sequence of 
moments, often these will determine the distribution uniquely (the moments problem). 

One of the critical terms in the runtime given in our main theorems is C_F(m, e): the sample 
complexity of approximating the m th moment tensor of distribution F to within accuracy (in 
the moment metric above). The competitiveness of our algorithm with other learning algorithms 
depends on the number of moments we need (ie the previous section), and the number of samples we 
need to attain the required accuracy. This latter problem is well-studied, and there is an impressive 
body of literature surrounding it. In particular, when m = 2, the problem is of interest to random 
matrices community, who have provided strong bounds in a number of important cases. We will 
provide a brief overview of these results, but this by no means is intended to be a comprehensive 
survey of the literature! When the distribution F is isotropic and almost surely supported in a 
ball of radius 0(y/n), Rudelson [35] gave a very strong bound on Cf(ti, e) to achieve the following 
guarantee: 




Rudelson required only 0(nlog(n)) samples when F is almost surely supported on a ball of radius 
0(y/n), and where the constant is dependent on e. Adamczak et al. [1] were able to improve this 
bound of O(n) samples. Their assumptions were support on a ball of radius 0{^fn) as before, and 
a subexponential moment condition: 

sup E((x T v) p ) 1/p = 0{p) 

\\v\\ = l 

As an application, they showed that logconcave distributions satisfy these assumptions, and thus 
their covariance matrices can be sampled very efficiently. Subsequent work by Vershynin and collab- 
orators [39, 44 J has broadened the class of efficiently samplable covariance matrices to distributions 
where 2 + e moments exist and also to distributions where the m th moment is bounded by K m for 
some constant K. 

Finally, in the setting of higher moments, there is the result of Guedon and Rudelson [22J, 
which gives the sample complexity of sampling for higher moments of logconcave distributions. 
Their result is that 0(n m / 2 log(n)) samples are necessary to approximate moments in all directions 
up to an 1 + e factor. In particular, this leads to the observation that explicitly computing a 
sample moment tensor from n m / 2 samples is actually less efficient than simply storing the points, 
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computing the inner products to the appropriate powers and summing. This last result is used 
in our applications in Section [6j as it allows us to handle many distributions efficiently, including 
Gaussians and uniform distributions over convex bodies. 

6.2 Robust learning 

For learning over a /c-dimensional subspace, we have the following proposition: 

Proposition 9 (VC dimension). Let % be a hypothesis class with VC dimension d. Let t £ % be 

a subspace junta with relevant subspace V , where dim(V) = k. Let U be a k dimensional subspace 

where £(nu) labels a 1 — e fraction of points correctly. Then we can learn I with sample complexity 
( 1 / e ) C2 diog(i/ e )+c 2 iog(2/5) wUh probabi i it y a t l -3. 

Proof. To come up with a hypothesis over U, we take a new set of samples S of size m and project 
them onto U. By robustness of T~L under F, we know that Pr (£(iru(x)) = £(x)) > 1 — e. Then we 
guess the correct labels by trying all relabelings of subsets of size em. One of these relabelings 
will give us a labeling consistent with £ viewed as a function of the fe-coordinates in U. For each 
relabeling we attempt to learn the labeling function. On the correct relabeling, we can learn £ to 
with at most e fraction of errors. By the theorem above, our total error over M n is 2e. 

To bound m, we apply an idea from [9] via a slight extension (Theorem 5 of [2]). The required 
bound is m > (32/e) log (C[m]) + (32/e) log (2/5) where C[m] is the maximum number of distinct 
labelings obtainable using concepts in H over M fc . In particular, we have C[m] < Yli=o (T)> whence 
C[m] < m d . A computation reveals that m > c(d/e) log(l/e) + (c'/e) log(2/<5) suffices. The number 
of relabelings is ( e ™), which is upper bounded by (m/e) me < (i/ e )C2diog(i/e)+c 2 iog(2/<5)_ Q 

As mentioned previously, we can view the work of |42] as a specialization of our algorithms to 
the j = m = 2 case in FindBasis. We give examples here where the second moment does not 
suffice, and we must use higher moments to resolve the relevant subspace V. Our examples are: 
([T|) hyperrectangles (cuboids) in balls, ([2]) subsets of balls, and ([3]) concepts which have compact 
support. In all our examples, the algorithm used is LearnUnderGaussian. We will prove that we 
can find the relevant subspaces by running FindBasis on either the full distribution or distribution 
conditioned on positive labels (the "positive" distribution). 

We use the uniform distribution over a ball in M fc in the relevant subspace. We need the following 
elementary fact. 

Claim 19 (Isotropic balls). Let F be the uniform distribution (with density p) over B#(0) C M. n 
where R = \Jn + 2, then E ((x T u) 2 ) = 1 for any unit vector u. 

By a hyperrectangle, we refer to a region of space which is the Cartesian product of closed 
intervals i.e. S = [oj, b{] x • • • x [a*., bj.] C M. k : 

Application 1 (Hyperrectangles in balls). Let F = FyFyy where Fy is a uniform distribution 
over a ball B and k = dim(V), Fw is any Gaussian over n — k dimensions. Let S C B denote 
a (hyper)rectangle in V . Take the hypothesis class T~L = {(xs( 7r y))( a; ) '■ S C M} to be the set of 
functions which assigns positive labels to points whose projection to V lies in the interior of rectangle 
S. 

Proposition 10. The triple (k,F,7i) as defined in Application [I] is (4, 6/ (5k)) moment-learnable 
with time and sample complexity poly(fc, 1/e) + Cfc j)E n 2 . 
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Proof. Without loss of generality, we may assume that IB = ^^+2(0) after isotropic transformation, 
and that the Gaussian over Fy/ is a standard n-dimensional Gaussian. Furthermore, we may assume 
that S is centered on the origin as well (i.e. we apply Lemma [5] to the positively labeled points). 

Suppose we now run LearnUnderGaussian on the positively labeled samples. We start with 
the second moment (r = 2) in our algorithm FindBasis: the second moments of a uniform distri- 
bution over a rectangle are fully determined by the second moments along the axes of the rectangle. 
In particular, FindBasis using the second moments will simply give us every axis of the rectangle 
where the second moment is not 1. A simple calculation of the moments of a uniform distribution 
over a rectangle along axis Xi where the rectangle has length 2Si gives: 



Thus, using the second moment will give us all the axes of our hyperrectangle except where the 
rectangle has length 2Si = 2\/3. Projecting orthogonally to these axes, we now consider the third 
moments (r = 3): the third moment of our uniform rectangle is clearly in every direction by 
symmetry of the rectangle. Thus, we turn to the fourth moment - note that fixing S{ = \/3 fixes 
the fourth moment along each axis of the rectangle, in particular: 



Unfortunately, the equality of the fourth moment along the axes of a rectangle does not necessarily 
imply the same fourth moment in every direction. However, iterating Lemma [3] allows us to bound 
the fourth moments away from the fourth moment of a Gaussian 74 = 3: 



Thus, we have our moment learnability using only the fourth moment! Now that we have the 
relevant subspace V, we can simply learn our rectangle in a dimension k space, which takes poly(fe) 
time. Moreover, note that since all the distributions are logconcave, we can apply the moment 
sampling results of Guedeon and Rudelson mentioned in Section 16.11 - in particular, we can take 
the number of samples required to be Cj?(m, e) = C t n 2 . Thus this gives a final runtime of poly(A;) + 
Ck y€ n 2 where C^g. □ 

The key point here is that we have very low polynomial dependence in n. This conforms well 
with our model where we think of k as being small compared to n. We can, in fact, prove a stronger 
result — we can always find the relevant subspace if Fy is a uniform distribution over a ball: 

Application 2 (Uniform distributions over balls). Let F = FyFy/ where Fy is a uniform distri- 
bution over a ball B and k = dim(V), Fy/ is a Gaussian. Let H be a robust hypothesis class which 
we can learn with complexity bounded by T(k,e). 






where the sum is taken over directions corresponding to axes where Si = \/3. Now by applying the 
Lagrangian style techniques of Lemma HI we can bound this by: 



E((x T n) 4 ) < 7 4 



6 

5k 
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Proposition 11. The triple (k,F,H) as defined in Application^ is (4,^(1)) moment-learnable, 
with the time and sample complexity bounded by T(k,e) + Cfc i£ n 2 . 

Proof. We will examine what happens when we run FindBasis on the full distribution (as opposed 
to the positive distribution in the previous example). We compute the fourth moment of a ball of 
radius R = \/n + 2. For simplicity, we will assume that k = 21 + 1 for some positive integer I ie k 
is odd: 



E (xf) = / xfpdx 
R r 

x\pdx2 ■ ■ ■ dxkdxi 



R Jm J. (o) 



1 



R 

x\ vol ( M k ~\ (Q) ) dX! 



vol (B£(0)) J-r 1 V v^^f 



vol(Bi _1 (0)) rR / T 2xi 



vd(B*(o)) y-^l 1 #J dxi 

We first examine the volume ratio: using the recurrence: 

vol (B|(0)) = vol (l^ 2 (0) 

and unrolling the recurrence, we have: 

vol (2/) __ (2Z + 1)!! 
vol (21 + 1) ~ 2i?(2/)!! 

1 (2/ + 2)!! 



2R (jt + l)&+H& 
1 (2/ + 2)!! 
2R(l + l)\l\2 2l + l 



Applying Stirling's approximation, we have: 



vol (2/) 1 ^2^(2/ + 2) 1 /2/ + 2y' +i /e 



2«+2 , / \ (Z+l) 



vol(2Z + l) 2i? 2vr v / Z(Z + 1) 2 2i +! V e y V Z / + 1 

a+i) 



1 1 2 // + l x ' 
2~R^fe 
1 Z + l 




1 



z(/ + 2) y 

Returning to the integrand, we can simplify it somewhat: 



i - ^ ) 
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By explicitly taking the integral (using a computer algebra system), we have: 



3^(21 + 3) 5 / 2 r(z + 1) 

dx\ = — 



ST (I + 7/2) 

where T here is the usual gamma function. The behavior of this function is as follows: 

3^(21 + 3) 5 / 2 r(Z + 1) = R 
l— too 8T(l + 7/2) V 2 

Moreover, the function is monotonic increasing for I > 0, and takes on the value 56v7 /45 at / = 2. 
Thus, combining these facts with the estimate of the volume ratios, we can see that the fourth 
moment of a ball is bounded away from the fourth moment of a standard Gaussian by a constant, 
hence we can take rj = $7(1). Once we have the relevant subspace V, we can project the samples to 
V and learn in time T(k, e). The runtime in this case is T(k, e) + C/c )(E n 2 . □ 

As a specialization, when the positive examples are determined by a convex subset of the unit 
ball, T(k,e) < (k/e)°^ k \ In a £;-dimensional subspace, we can learn a convex subset of the ball by 
simply taking the convex hull of (k/e)°^ random positive points. From the classical approximation 
theory of convex bodies [34] , we obtain an approximation to the true convex body to within relative 
error e, giving total runtime (k/e)°^ + Cfc i£ n 2 . This complements [32] which provides a PCA- 
based algorithm for learning convex bodies when the distribution in the relevant subspace is also 
Gaussian. In that paper, it is mentioned that standard PCA fails if the full distributions is not a 
Gaussian. 

We now present an example that relies on boundedness - either of the full distribution in the 
relevant subspace, or the positive distribution. This rather general result uses relatively many 
moments. 

Application 3 (Compact distribution in relevant subspace). Let F = FyF\y where Fyy is any 
Gaussian over n — k dimensions. Take % to be a robust hypothesis class learnable with complexity 
T(k,e). Assume that either Fy or% has its support contained in B g ^(0). 

Proposition 12. The triple (k,F,T-L) described in Application^ is (g(k),Q(l)) moment-learnable 
with complexity T(k,e) + Ck^rP^ 9 ^ \ 

Proof. Suppose we run FindBasis on the full distribution or the positive distribution, whichever 
is contained in a ball of radius g(k). Consider the relevant subspace. If we fix some even moment 
m then we can give explicit bounds on the moments: 

E((x t ) m )<g(k) m . 

On the other hand, the even moments of a Gaussian are given by (m — 1)!! = m!/(m/2)!2 m//2 which 
grows much more rapidly. If we take logarithms on both sides, then we can find m = m{k) such 
that: 



m log(g(fc)) < log 



ml 



[m/2)\2 m / 2 
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Applying Stirling's approximation yields: 



— log(g(k) ) < m log(m) - m - — log i^—j + — - y log(2) 

m m 
< — log(m) - — 

So if we pick m = 2g(k) 2 , then the difference in the moments should be Thus, simply running 

FindBasis on the full distribution will allow us to recover the relevant subspace, at which point we 
can learn T-L in R. h (doable in time T(k)). It remains to prove that we can sample the first 2g(k) 2 
moments of a bounded distribution efficiently: since it is bounded, all moments exist. In particular, 
if we require 2g(k) 2 moments, then the 4g(k) 2 moment is bounded by g(k) i9 ^ . Then by applying 
Chebyshev's inequality, we see that we need at most g(k) 0( - 9 ^ ) samples in the relevant subspace. 
The overall runtime for this algorithm is then T(k, e) + C^^n ^ 9 ^ ). □ 
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